CAGEF_services_slide.png

Introduction to Python for Data Science

Lecture 02: Lists, Arrays, and DataFrames - Oh My!


0.1.0 About Introduction to Python

Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

The structure of this course is a code-along style; it is 100% hands on! A few hours prior to each lecture, the materials will be avaialable for download at QUERCUS and also distributed via email. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!

As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark).

0.1.1 Where is this course headed?

We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:

and get you to a point where you can:


0.2.0 Lecture objectives

Welcome to this second lecture in a series of six. Today you will dive into more detailed data structures, packages that works with them, and build up towards our a more standard data science structure - the DataFrame.

At the end of this lecture we will aim to have covered the following topics:

  1. Python lists
  2. Python dictionaries
  3. Python tuples
  4. The NumPy package and arrays
  5. The Pandas package and DataFrames

0.3.0 A legend for text format in Jupyter markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python

0.4.0 Data used in this lesson

Today's datasets will focus on using Python lists and the NumPy package

0.4.1 Dataset 1: aminoacids.csv

Just a simple amino acid table that we'll be importing later in lecture.


0.5.0 Packages used in this lesson

IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.

random is a package with methods to add pseudorandomness to programs

numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.


1.0.0 Python data structures have properties and functions

As discussed in lecture 1, everything in Python is an object:

The above are all objects but also data types in Python. We can store these data types in data structures to properly store, format, and model our data. The decision of what data structure to use depends on the objective(s), data types, and the tasks to perform. For example, some data structures can handle only one data type at the time (all numeric or all character) but are computationally very fast. Others structures can store several data types but are computationally very expensive and slow, especially when we have large datasets (thousands of rows and columns).

Another feature to look for on data structures is their mutability; some structures can be altered after they are created (mutable), some are not (immutable). Let's take a look at some of Python's core data types (built-in data structures).

Further Reading: If you'd like to learn in more detail about what makes a data object mutable or immutable, you can find a good breakdown and set of examples from Megha Mohan

1.1.0 Lists are ordered collections of arbitrary data types

The first Python object that we will introduce is lists, which are ordered collections of data of one or several types (strings, booleans, etc.) where each datum is called an element or item. Lists are easily identifiable because of the squared brackets that enclose their elements.

Caution: In other programming languages, Python's lists are usually called arrays, but be aware that Python has its own version of arrays as part of a package called NumPy (section 4.0.0 of this lecture).

1.2.0 A list can contain different data types

When working with different data types and information that you'd like to pass around, it's convenient to know that you can combine this information into a single list. As mentioned above, lists can be arbitrary types so long as you remember the order of your elements, you can put a variety of data types into the same list object and you won't be subject to coercion.

Let's try!


1.2.1 Use list() to initialize an empty list

When writing flow control programs (lecture 4), we need to create empty structures in advance so the program has a place to write its output. This is called "initialization" and to create a list we use the list() function. In fact, all classes (which make objects) should have some kind of initializer to create an object even if they are essentially "empty" containers at their outset.

Remember you can inquire about the class of an object with type()


1.3.0 Lists can be assigned to variables

So far we've just been making lists that disappear from memory but you may want to make a list that you can pass around, grow, shrink, or pull information from.


1.4.0 Access portions of a list using [ ]

Items in a list can be accessed using squared brackets the same way we did with strings (lecture 1). Many of the same indexing methods and mechanisms work the same way as with strings. Recall we use the [index] syntax to access a single element of our list which is also zero-indexed.

The items in a list can be modified by indexing the item that we want to change (remember that lists are mutable)

Mapping: The relationship between indices and items in data structures is one-to-one. Each index "maps to" one element.

1.5.0 Check for presence of elements using in

We can ask if a single element is present within a list using the in keyword. This can be a quick way to determine if your list has the element or item you are seeking.


1.6.0 Perform basic functions on lists

We can perform mathematical operations on lists and between lists. Let's explore list_2 using some built-in functions such as


1.7.0 Using operators on lists

Yes, like most data types you can use basic operators on lists but what will their behaviour be? Remember from last week that we saw different behaviours between numbers and strings! Let's explore further.

1.7.1 Use the + operator to concatenate two or more lists

Much like strings, the + operator takes on a different behaviours when working with lists versus numbers. Rather than being interpreted as an addition operation, it is a concatenation symbol, allowing you to combine two or more lists together.


1.7.2 Use the * operator to repeat your list

Again we see a separate behaviour for a mathematical operator given the context of a list data structure. If you'd like to repeat your list one or more times, you can use the * operator.


1.7.3 Yes, you can slice a list too with [start:end]

I think we are seeing a pattern now between lists and strings (why do you think that is?). So we can slice our lists using the same notation we learned in lecture 1. Just remember that we are working with [inclusive:exclusive] form. So what does that really mean? Let's review.


1.7.3.1 use the random.seed() command to set up reproducible "random" sequences

Let's try something more "practical" by generating a random sequence of nucleotides and working with that. We'll introduce the seed(n) function from the random package. We'll also use the sample(sequence, n) function to take an item from our list without replacement. Remember how to access functions from packages?

Furthermore, we'll be seeing our first use of a for loop but we'll dig deeper into that in lecture 4 (flow control).

1.7.3.2 A note on simulating "random" events.

The only random things you'll find in computer science are whether your programs will run on the first try and whether you'll understand them 6 months from now.

More seriously, the ability to generate a truly random sequence of numbers is not possible. We can approximate randomness - especially with special hardware but from a software perspective we can only mimic stochastic processes. Generally our approximations or pseudorandom algorithms are, to the casual observer just as good as truly random events. They are however, deterministic and can be repeated if we know the start state of the process. Usually a random number generator might use something like the system time as a seed to initialize its state but if we use a specific seed, we can get repeatable results.

Slice the first 10 nucleotides which are CGAGACACGG

Why is the first cytosine missing?

Can you think of another way to index the first 10 nucleotides?

Analogously, omitting the second index, returns elements all the way to the last element


1.7.3.3 Update your list with slicing

Slicing can also be used to update several elements at the time. Replace bases G and A (bases number 2 and 3 from gene_sequence) by R (R in a DNA sequence means that either adenine or guanine (puRines) can be found at that position)

Caution: Here is a reminder on the behavior of lists. Now that we've replaced some of the elements we can't undo it! Luckily we made a copy of the list when we made it.

Now, run the code below several times (press ctrl/command + enter 10 times consecutively) . Pay attention to the prints. Do you see what is happening?

Okay, let's go back to our task. Before that, though, go ahead and copy gene_sequence_copy one more time

Back to our task: To change the second and third nucleotides in gene_sequence so we should start at index 1, not at index 2. Again: Python has zero indexation

1.8.0 Use the join() method to concatenate list elements

Naturally, the most appropriate way to store a gene sequence is as a string, not a as list. Let's type convert gene_sequence into a string with the join() function which takes the form of separator.join(sequence) where

Let's give it a try!


1.9.0 Additional methods for working with lists

Time to move to other aspects of working with lists. Do you recall what methods are? Python objects, such as a list, have methods (behaviours) that can be applied to them to carry out functions. We access these with the . syntax much like functions from a package. In this case, however, we are accessing methods that belong to the object which is access by our variable. Thus our general syntax is variable.method().

Let's say that we want to add "biology" as an element to list_1. We know that the + operator concatenates two lists together, so we need to covert "biology" into a list and then concatenate the two lists together. Easy peasy, right?

1.9.0.1 Conversion of a string using list() breaks up your string!

As you can see above, our type-conversion or casting of the string "biology" resulted in a list of the single characters that make up the string. Not exactly the behaviour we wanted! Instead we could cast the string "biology" as a list with the [] operators using:

bio = list(["biology"]) or bio = ["biology"]

1.9.1 Use the .append() or .extend() methods to quickly add to your list without type-casting!

Rather than spend extra code instantiating a variable and type-casting it to a proper list, you can use the .append() method to add a single element to your list. To add multiple elements use the .extend() method instead.

Notice that list_1 was modified while list_2 remains intact

1.9.2 Use the .sort() method on your list to order the elements

Another method, sort() lets you sort the elements in a list. As a caveat, .sort() requires all the elements to be of the same type in order to make proper comparisons. It can come in the form myList.sort(reverse = (True|False), key = MyFunc) where:

Note that as a method, .sort() will permanently change your list, just like .append() or .extend()!

.sort()'s default behavior is .sort(reverse = False) (lowest-to-highest) but it can be overwritten. How would you call on .sort() to give you the reverse order?

Argument: A parameter of a method or function that can be changed to alter its default behaviour. The default values of a method can be accessed using the help() function. Knowing the dafult behaviours of a method/function is very important, especially for statistical applications and regardless of what programming language or software you use.

1.9.3 Removing elements from the list

We've seen how to add and replace elements within your list but sometimes you want to delete an element completely. Let's look at some of the ways to accomplish this.

1.9.3.1 To remove and retrieve a deleted element, use the .pop() method

The .pop(index) method will return the removed element, while altering the list object. This way you can save that element, move or copy it to another list, or run analyses on the value/object as needed.

Caution: If no index is provided, the default behaviour of .pop() is to remove the last element of the list like popping the top (last) pancake off a stack of pancakes. It will do so without issuing any warnings at all.

1.9.3.2 Use the del() function to remove an element

If you are not interested in the removed value then you can use the del() function. Note how we aren't using the same kind of notation as above? We're using it as a function instead. There are two forms of syntax to use this function but it basically works with slice notation as well!

1.9.3.3 Use the .remove() method to remove an element without knowing its index

More often that not, we tend to remember what element we want to drop but not its index. The .remove() method will find and delete the first occurrence of an element mentioned in its argument

Caution: If the element you've provided is not in the list, it will produce an error.

Let's remove the number 100 from list_2, this time with .remove():


1.10.0 Use the .split() method to break a string into a list of strings

Sometimes you may have a complex string that you'd like to break up based on a pattern, specific character, delimiter etc. To accomplish this we can use the string method .split(). The resulting output returned, however, is a list of strings regardless of whether or not the pattern itself is found.

Now, let's subset "biology" from list_1, then use .split() to split "biology" wherever there is an "o".


1.11.0 You can make a list within a list

As we mentioned in the beginning, a list can have an arbitrary grouping of data types/object/elements. That means we can make a list with a list. This is also known as a nested list. When it comes to making more complex structures, it means we can provide a hierarchy to a structure using lists.

1.11.1 Use [ ][ ] to access individual elements of your nested list

In order to access specific parts of your list, you'll need to remember its structure and use that information. When there is an ordered pattern to your data structure, it can be easier to generate code for traversing the nested list. The [] operators will begin their access from the top-most level of the nested list and move downward through each level.

For a 2D nested list, you can think of it like a table or spreadsheet where you access using a [row][column] syntax.


1.12.0 Challenge 1: exploring additional list methods

You are interested in finding all "GCC" codons and their location in list_alanine. Tip: Run dir(list_alanine) to see a list of all the attributes and methods that are part of list list_alanine.

Caution: dir() unfortunately returns all of the attributes and methods from a class but it does not tell us which items listed are attributes vs. methods. You can, however, explore those more with help() or the getattr() functions.
Caution: .index() Returns only the index of the first match (hit) it encounters. In the above code, a second instance of GCC would not be returned.

1.12.1 A brief description of list methods

Here is a more detailed description of the list methods and what they do:

Method call Description Alters the list?
append() Add an element to the end of the list Yes
extend() Add all elements of a list to the another list Yes
insert() Insert an item at the defined index Yes
remove() Removes an item from the list Yes
pop() Removes and returns an element at the given index Yes
clear() Removes all items from the list Yes
sort() Sort items in a list in ascending order Yes
reverse() Reverse the order of items in the list Yes
index() Returns the index of the first matched item No
count() Returns the count of number of items passed as an argument No
copy() Returns a shallow (new memory) copy of the list No

2.0.0 Dictionaries are associative arrays

So far, we have seen that lists are very useful to store diverse types of data in a single structure. However, lists are not that convenient to use when we need to extract data in groups or elements that "belong together". Let's revisit our nested list list_aminoacids from section 1.11.0. What if you were interested in getting any data associated with aspartic acid? Do you remember what index Asp was at? Under this circumstance, we are better off using dictionaries.

Dictionaries are similar to lists and they can also take (almost) any data type. Unlike lists, however, dictionaries are not ordered. Intead, dictionaries map a key to a set of values that are related to that key, i.e. keys:values that belong together. In our example, Aspartic acid is the key and the set of value is "Asp".

Dictionaries are created by using { } (curly brackets), a feature that also makes them easily identifiable. Unlike lists, though, dictionaries keys are immutable and have no defined order. As a side note, Python dictionaries are usually called hashes in other programming languages. In summary:

For starters, let's create a dictionary called aminoacids_dict.

Now that we created a dictionary, let's access its keys and values

What is the problem? I am 100% certain that alanine is in that list...

2.0.1 Use the dict() function to intialize an empty dictionary

Analogously to lists, empty dictionaries can be created with the function dict() or just with { }. These can be assigned to variables when needed.

Caution: Dictionaries do not allow repeated keys. If you enter a key:value pair, only the value will be updated

2.1.0 Adding entries to a dictionary

2.1.1 Add single entries using the = operator

Single entries can be added to dictionaries using the = (equal) operator with the syntax dictionary[key] = value.


2.1.2 Add multiple entries with the .update() method

If you have more than one entry you'd like to add to your dictionary at once, you can essentially create a second dictionary to add through the .update() method. Remember that {key1:value1, key2:value2} initializes a new dictionary.

2.2.0 Changing the values within a dictionary

2.2.1 Make a dictionary containing sets as values

Now, let's rebuild our aminoacids database, this time with their respective symbols and encoding codons stored as sets. The main differences between sets and lists are:


As you can see, Python objects can be overwritten so be careful when you create new variables. Immutability, in the case of dictionaries, makes reference to the fact that keys can be added or deleted but not changed.

2.2.2 Alter key:value pairs with the .update() method as well

Not only can you add new entries with .update() but you can also (as seen in our previous example) update the key:value pairs within a dictionary.


2.2.3 Use the = operator to also update key:value pairs

As with .update(), we can directly access a key:value pair and alter the value using the dictionary[key]=value syntax.

Let's bring back the Alanine values 'Ala', 'A', and 'GCA GCC GCG GCU'


2.3.0 Remove values from a dictionary with del() or .pop()

Much like lists, we can remove values from the dictionary directly with the del() function or one at a time using the .pop() method. These operate much in the same way they do for lists.


2.4.0 Additional dictionary functions

There are a number of additional dictionary methods that can provide information about or alter a dictionary object. For example len() gives us the number of key:value pairs in a dictionary.

Under the hood: len(s) is a Python core function that can be used on many objects like lists, strings, and dictionaries. Each structure, s, has it's own method __len__() which len() calls on.

Let's try out the len() function.

2.4.1 Challenge 2: Determine if a key is present in our dictionary

We are missing STOP and START codons in our dictionary. Use Python code to demonstrate that STOP is not present in dictionary_aminoacids. Tip: Your answer should be a boolean.

2.4.2 Challenge 3: Retrieve all of the values associated with a key

Use .get() to retrieve all values associated with glutamic acid. Tip: Use help() to find out more about the usage of .get(). If that is not enough, look it up on the internet.


2.4.3 Use the .values() method to extract values only from the dictionary

If you need to extract only the values from a dictionary, use the method .values(). This will return a dict_values object but it can also be cast as a list object.


2.5.0 Use a list element as a dictionary key

All of our examples so far have looked at the types of objects that can be used as values. To remind you, the keys of dictionary are made of immutable types such as integers, strings, booleans, or tuples (coming up!), but no lists or dictionaries - remember these are mutable.

Read more: You can find more details about dictionary implementation here

However, elements of a list can be passed on as keys to a dictionary provided that they too are immutable elements.


2.6.0 Dictionaries can also have nested structures

Just like lists, you could use dictionaries as values within your dictionaries, thus generated nested dictionaries. Like nested lists, you can access hash keys at each level with progressing use of the [key] syntax.


2.6.1 Dictionaries cannot be sliced with [ ] notation

Unlike lists which still have an ordered nature, the dictionary keys do not have an ordered index. Therefore you cannot access specific elements by their "position" within the dictionary. Subsequently, since the elements have no order, you cannot use the slice notation [start:end] with dictionaries either.


2.7.0 A summary of dictionaries

To summarize, list and dictionaries are very similar but have three key differences: Indexation, order, and mutability. If you need fast indexation to look for unique keys and their values, go with dictionaries.

Run dir(dict) for more methods on dictionaries.


3.0.0 Tuples are immutable lists

We've already mentioned the concept tuples but haven't discussed clearly what these objects are. Also known as structured arrays, the tuple is simply put, an immutable list. Let's compare them to lists for a better understanding.

Time to make a tuple with multiple elements:

If you try to change an element in a tuple, you get a traceback (Python error). Let's make Alanine all lowercase:


3.0.1 Initialize using the tuple() function

Like dictionaries, the immutability of tuples makes them good candidates to reliably store data that you do not want to be changed (like by accident). Similarly to lists, the function tuple() will split a single string into its component characters and will store them as elements:

3.1.0 Tuples only have two methods: .count() and .index()

As mentioned above, given the immutablitiy of the tuple, it doesn't need some of the same methods as a list. As a consequence, tuples have only two methods: .count() and .index()

3.1.1 Work around immutability with lists and the .append() method

Methods such as .append() do not operate on tuples because of their immutability. A workaround to append elements to a tuple is to first create a list, append elements to it, and then use type conversion to make a tuple.


3.2.0 Retrieve elements from a tuple by index

Accessing tuple elements is the same as a list and you can use the [] notation to assign elements to individual variables one by one.

You can also use multiple assignment (remember lecture 1?) to assign several variables at once. Just be sure that each side balances out! Note that you can also use slice notation on a tuple.

Python trick: Variables can be swapped with the syntax a, b = b, a. That will assign a to b and b to a.

3.3.0 Use the .items() method in dictionaries to make tuples

Jumping back quickly into dictionaries, you may want to preserve or move around a key:value pair. You can do so using the .items() method which will return a dict_items object. From there, you can use tuple() to cast it over to a tuple object either individually, or as a tuple of tuples.


4.0.0 Introduction to NumPy

Built-in Python objects are very versatile: They can hold different data types, you can do so basic mathematical operations with them, can be mutable or not, and many more. However, they lack a key feature: Perform simple and advanced mathematical operations on data structures, in ways that are time- and computationally-efficient.

NumPy (Numeric Python) is a package for scientific computing developed by Travis Oliphant and first released in 2005, based on a pre-existenting python package called Numeric. NumPy has advanced and changed a lot since its first release, all thanks to a very active community of programmers that have contributed their time and effort to improve NumPy.

In terms of data structures, NumPy offers an alternative to built-in lists called NumPy arrays, which allow mathematical operations across one- or multi-dimensional arrays. Let's start diving into NumPy's functionality. First, though, we need to install NumPy and import it as np (not required but it is the standard).

4.1.0 One-dimensional NumPy arrays

A One-dimensional (1D) array object has all data as a single row. Sounds like a list right? These structures, however, can only contain a single data type and will perform coercion without any warnings. Let's create our first NumPy arrays

4.1.1 Use the .shape property to retrieve array dimensions

If you're not sure how many elements, columns, rows etc., you have in your array, you can quickly retrieve it using the .shape property. We'll talk more about what a property is in section 5.2.0.

4.2.0 Mathematical operations on arrays are performed element-wise

Remember that using operators like + and * on lists end up producing behaviours related concatenation of the list objects themselves, with arrays we can perform element-wise math operations if the arrays are the same size and if the elements are suitable for math operations.

If we add up two NumPy arrays, their elements will be added up element-wise. The same rules apply as above

4.2.1 Adding a list to a NumPy array coerces the list

Adding a list and an NumPy array results in the list behaving as an array (coercion) to complete the operation. Remember that the addition has to make sense to the interpreter too and a list is simply coerced to an array object. Will its contents be coerced as needed?

Caution: NumPy arrays may seem a lot like lists but don't expect core python operators to behave as dynamically with them as we see with base core objects like numbers vs strings vs lists!

4.3.0 NumPy arrays are also data structures

Just like lists, tuples, and dictionaries, NumPy arrays are data structures that have their own properties, and methods. As you've seen, we initialize arrays with the np.array() function. Similar to lists, we can use [] annotation to subset and slice these with the expected behaviour.


4.4.0 Use conditional operators on arrays

Up until now we haven't really used conditional operators but we can quickly ask certain objects about their elements and if they fulfill a condition with a True or False result. With NumPy arrays, we can perform these conditional statements in an element-wise manner.

4.4.1 Retrieve values with conditional statements

Instead of just looking at an array of booleans, which can get quite long in large data sets, you can instead feed your conditional result back into the original array to retrieve the values that return true. This effectively gives you an array of actual element values that you might "want" to work with. We achieve this with the syntax array[conditional statement] in the above case array_1 > 17 is our conditional statement. Let's try!

4.4.2 Negate the values of your conditional array with ~

In Python there are a number of ways to negate booleans (create the opposite value). With the NumPy package and arrays in particular we can use the ~ operator to perform a bitwise negation on the array. Like with all objects, however, if used outside this context the behaviour of the operator may not be as expected!

4.5.0 Two-dimensional arrays

Put simply, a 2D NumPy array has rows and columns (hence the 2D name). Similar to their 1D counterparts, all data in a 2D array must be of the same type. We can make 2D arrays by combining two individual lists of the same size.

or by type-converting a nested list

4.5.1 Subset 2D NumPy arrays with [row][column] notation

To subset 2D arrays we use two sets of squared brackets: one for rows and one for columns which looks like my_array[row][column].

Caution: Subsetting arrays with only one index defaults to rows. E.g., my_array[0] returns row 0.

Let's select the element "15" from array_2.

4.5.2 Subset 2D NumPy arrays with [row, column] annotation

The same result as above can be achieved using one set of squared brackets with a comma that separates rows from columns which looks like my_array[row, column].

4.5.3 Use slice notation to select subsets of your array

That's right! Slicing has been implemented for arrays so you can pull out proper subsets unlike nested lists! For example, to select all rows or columns, use a : (colon) at the respective side of the comma. Let's practice some slicing.


4.6.0 Using NumPy for statistical applications

So far we have been doing very basic operations to experiment with Python objects. For that reason, we have been using small datasets. In reality, we get big datasets that need to be analyzed in ways that are efficient and reliable.

In order to explore NumPy's statistical capabilities, we are going to simulate a more complex, very popular dataset in the data science world called iris. We will showcase some exploratory data analysis capabilities but we encourage you to look into R to do advanced stats and data visualization (CAGEF also has an introductory R course!).

Let's create the iris dataset


4.6.1 Use the mean() function on arrays

Remember that both arrays and many of the statistical functions we'll be introducing are part of the NumPy package. That being said, there are implicit expectations regarding behaviours and attributes present in objects like arrays. Without getting too much into the philosophy, it can make the introduction of additional functions to a package both more flexible and rigid.

Let's take a closer look at the mean() function. Looking at the documentation we see the following information numpy.mean(a, axis=None, dtype=None, out=None, keepdims=<no value>, *, where=<no value>) which we can break down to:

Let's start with the basic use of mean() and build up from there. First, we can calculate the average for sepal length (column 1)

4.6.1.1 Using the axis parameter to calculate the mean across dimensions

As you can see from above, when we use the default behaviour of mean, it treat an array, regardless of dimension, like a flat list of numbers and take the mean of the whole set. What if, we want to calculate mean across rows or columns? Intuitively it can be a little confusing but recall that we use [row, column, etc] notation to access arrays. If we are working with an array with n rows and m columns then

Another way to think about it is that axis=0 returns a row of means, and axis=1 returns a column of means.

Note that we aren't even talking about multi-dimensional arrays where axis can also be assigned as a tuple to identify different dimensions you'd like to perform the calculation across.


4.6.2 Other functions available in NumPy

We've been looking at just the mean() function but we can achieve different calculations with similar behaviours from


4.6.3 Linear algebra and tranpose() with arrays in NumPy

Sometimes you'd like to manipulate the shape or order of elements in your arrays. A common method that you may wish to perform is the transpose() method. This is a method available through the array object and it will take care of all of the details even for multi-dimensional arrays. It's actually part of the array manipulation set of routines.

For now we'll stick with the straightforward 2-dimensional array.


4.6.4 Calculate the inverse of a matrix

You may remember from first- or second-year algebra the following formula

$$ A x A^{-1} = I $$

Where $A$ is a square matrix we can find it's inverse $A^{-1}$ such that the multiplication of them produces the identity matrix which has all 0s except for a diagonal line of 1s.

To calculate the inverse matrix, we can use the function inv() from the linalg module of NumPy.

Now what??? Why the "SyntaxError: unexpected EOF while parsing. What is EOF"?

EOF stands for "end of file", and it came up because we missed a closing parenthesis at the end of the code.


4.6.5 Calculate the dot-product with the dot() function

The dot-product of two matrices calculates the sum of the product of elements between rows of matrix A and columns of matrix B. Sound familiar?

dot-product.jpg https://algebra1course.wordpress.com/2013/02/19/3-matrix-operations-dot-products-and-inverses/

There's more to explore!: Check out more documentation on NumPy here

5.0.0 Pandas gives us the DataFrame object

To put it in context, Pandas expands NumPy capabilities in the same way that NumPy expands Python's. Pandas is a data manipulation tool developed by Wes McKinney, built on NumPy to simplify working with tabular datasets.

We'll cover this in more detail next week, but in properly formated tabular datasets, each column is a variable (a parameter that was measured) and each row is a set of observations (the results of quantitatively or qualitatively measuring each parameter).

There are two data structures in Pandas that we are interested in:

Structure Description Characteristics
Series A 1-dimensional array-like structure. Contains a single data type
Values are mutable but size is not
DataFrame A 2D labeled, tabular container for Series objects Resembles a spreadsheet
Size-mutable by adding columns

Here is some information about Pandas library architecture:

Other architectures include sparse (missing values), stats (statistical applications), util (testing and debugging tools), and rpy (R2Py, connectivity with R programming language).

First, install pandas using pip


5.1.0 Create a DataFrame from a dictionary object

We can start our journey into Pandas by creating data frames out of dictionaries! Recall that these use a key:value structure. We can use keys as variables (columns) and values as observations (rows).

First, let's build a dictionary

5.1.1 Use the DataFrame() function

Hard to read, right? It will look much better to human eyes if we convert it into a DataFrame object. We'll use the function DataFrame() to accomplish our goal and we'll also introduce a way to take a quick look at your data with the .head() method. This will allow us to look at a specified number of rows from the beginning of our DataFrame. In this case, the default number of rows is 5.

Take Note! We'll be using the variable_df suffix notation in variable names to identify them as DataFrame objects.

Notice how Jupyter has formatted the DataFrame for us into a nice readable table? Convenient!

5.2.0 Properties of objects

Recall that objects can be composed of attributes (values) and methods. Up until now we have been collecting information or working with objects through their methods. Depending on who has implemented the code for your object and the language you are using, most attributes are (by good practices) private. That means that under the hood, you can't simply alter these attributes directly. Instead, you'll call on helper methods to accomplish this. Python has developed a specific implementation whereby if you are allowed to access/alter an attribute it will likely be of the property object. Defining attributes this way, you can define specific ways to get and set information about them (with failsafes!).

More importantly, all of the details are hidden from the user and we can simply access these properties with the .property syntax. While it may look like we are directly accessing an attribute, we are not.

5.2.1 Use the .shape property to retrieve dimension information

How many rows and columns do we have? That question can be answered by retrieving the .shape property which will return a pair of values in the format of (row, column).


5.2.2 Use the .index property to retrieve/add/change row names in your DataFrame

From above you may notice a single unnamed column in our DataFrame. It looks like it is using the potential indices of our rows as labels. These are the row names of our data frame and they can be altered to carrying meaningful information like amino acid abbreviations instead.

To alter our indices we can access the property .index and assign it directly to a new list. Now, let's add indices (row names) to our data frame.


5.2.3 Use the .columns property to retrieve/change column names in your DataFrame

Much like the .index property, we can use the .columns property to get and set our column names.


5.3.0 Don't type in your DataFrames, import them with read_csv()!

Now we have Python a object that resembles a spreadsheet. However, creating data frames this way is very error-prone, let alone its tediousness. Small mistakes can pop up and what happens if your have hundred of thousands of rows of data?

The more sensible way to produce such large object is by reading in files using functions like Pandas' read_csv(). CSV stands for "comma-separated values" and is a type of text file like the TSV (tab-separated values) and other delimiter-separated values. Pandas will automatically store all of this information as a DataFrame object during import.

Let's import our first data file in Python using the aminoacids.csv file in this lecture's data directory.


5.3.1 Use the correct parameters to import your data

Column 0 is not automatically recognized as the indices so we need to explicitly state it using the right parameter. If you use the help(pd.read_csv) you'll see that there are a many parameters that can be assigned during our import call.

Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer: Union[ForwardRef('PathLike[str]'), str, IO[~T], io.RawIOBase, io.BufferedIOBase, io.TextIOBase, _io.TextIOWrapper, mmap.mmap], sep=<object object at 0x0000018D08E26E50>, delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal: str = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options: Union[Dict[str, Any], NoneType] = None)
    Read a comma-separated values (csv) file into DataFrame.

For our purpose we want to use the index_col parameter, whose default values is None, to let the function know that there is an index column located in column 0.

5.4.0 Other helpful DataFrame methods

So far we've seen some helpful properties for retrieving and altering attributes within our DataFrame objects. We've also seen the use of .head() to look at first n rows of our DataFrames. Similarly you can view the last n rows of your DataFrame with .tail() and you can retrieve an overall summary of the DataFrame with the .info() method.


5.5.0 Subset DataFrame and Series objects with the [ ] notation.

Pandas too can be subset with the [] syntax but in a limited fashion depending on the object:

We can pull annotation down using this notation in a number of ways like using the column names directly. Let's retrieve the column Aminoacid and see what kind of object is returned.

5.5.1 Subsetting a DataFrame by column returns a Series

A Pandas series is a 1D-labeled NumPy array that makes up rows and columns in data frames. Though they inherit much of their structure NumPy array, the Panda Series objects have their own attributes such as .values, which allows you to access the data contained in a Series but as a NumPy ndarray object.


5.5.2 Subset multiple columns in a DataFrame by providing a list

The values in a series, as well a data frame rows and columns, are 1D NumPy arrays. The behaviour of using [] is such that by default of accessing a single column, a Series object is returned.

If, however we want to access a sub-portion of a DataFrame we can also provide a list. Here's where the notation can get funny because we define a list using [] as well! Therefore to subset multiple columns we need to use syntax that looks like dataFrame[["colName1", "colName2", "colNameN"]].

To top it all off, providing a list to subset your DataFrame will always return a DataFrame object. You can perform similar operations on a Series object as well!


5.5.3 Use the .loc and .iloc advanced data access methods for rows and columns

As you can see above the standard [] notation can grant us access to parts of a DataFrame or Series but these aren't particularly optimized in their function. The Pandas package, however, provides optimized access using the .loc and .iloc methods which provide similar behaviour so remember the difference between [] and [[]]!

Method Description Examples to put within []
.loc Used primarily for accessing using labels and will search for matches in this attribute ['a', 'b', 'c']
'a':'f'
'iloc Used primarily for accessing with integer positions [4, 3, 0]
1:7

Note that both of these methods also access an array of booleans where NA is treated as False. So you have options but be sure to choose the correct one!

5.5.4 Is it a View or a Copy?

Okay, it appears that we can use the [] to do almost anything but beware! We've been playing around a lot by pulling out sub-portions of our data frame. There is a lot happening under the hood but remember that we are dealing with objects! Depending on how we ask a Pandas object for access to it's data, it may be returning a view of the object or a copy!

In the above example we used .loc[row, col] notation to access our DataFrame but we also used .loc[row][col]. We've seen this before in accessing arrays as well, the ability to subset with in two ways.

In the first case, we are calling on a method and passing two parameters, row and column, to the method. For Python and the DataFrame object, it all happens in a single step by essentially going to that reference and pulling out what we want using the DataFrame's internal methods.

In the second case, we are chain-indexing. Depending on a package's implementation this can give you very different results! In the case of a DataFrame, we are asking Python to first retrieve just the rows of the DataFrame we want. When that object is returned it needs to be assigned a temporary place in Python memory. At this point, it could become a separate object or entity from the original - a shallow copy! After that, we are then subsetting by col in a completely different command (by Python's viewpoint). If we were using these commands to set values in our DataFrame we could be setting them in an object that simply disappears!

Simply put, unless you have good reason to, with Pandas objects such as a DataFrame, avoid chain-index notation!

Read more: For some deeper examples of this problem, check out the Pandas documentation on this specific issue.

5.6.0 Subset DataFrame objects by their attributes

That's right, although there are some limitations on just how well this can work, you can treat the columns of your DataFrame like an accessible property. You can do similar with a Series object.

You'll run into problems and errors if you've used reserved keywords, other Pandas package-specific names, or if you break valid variable-naming convention but other than that, it is possible to use .column to access by non-numeric labels.

Hint: Use tab-completion to see attributes of an object!


5.6.1 Challenge 1

Use .iloc method to replicate the output of:


5.7.0 Broadcasting values to multiple recipients

Data frames also supports broadcasting - the simultaneous transmission of the same message to multiple recipients. This general definition of broadcasting is, at least at this point, more informative than its computer-science technical definition. Broadcasting is a convenient and efficient way to create and populate (add data) to columns and rows. Lists and arrays also support broadcasting which we've kind of seen before when adding or multiplying across np.array objects.

Here is an example of broadcasting NaNs across a DataFrame. We'll also introduce the id() function which gives us the unique integer identification of an object.

Let's broadcast an entire new column ("broadcasted column") to aminoacids_csv


5.8.0 Missing Data

A common occurrence during data collection or generation is that some values could not be recorded, and this can happen for a variety of reasons: The equipment malfunctioned, some entries were deleted by mistake, or data was simply not available for patient A on a given day. These events lead to data gaps, and it is a critical issue that needs to be properly handled in order to get reliable insights about your datasets. Missing values are represented by NA (not available) and/or NaN (not a number).

Let's add some random missing values using the .reindex() method which will essentially create a copy of your DataFrame object and update the index with NaN values into rows for indices that did not previously exist.

Do we have any missing values?

Now, let's ask the opposite question: What observations are NaN?

In the case of summations and means, NA/NaN are treated as 0 (zero), and if all the observations from a variable are NA/NaN, the output is NA

Caution: Be careful, though: NA/NaN are not just simply equal to zero. Sometimes you can treat them as such but this is case-dependent, and its ultimately you as a researcher/data analyst/data scientist who establishes the potential impact of those missing data in your analyses.

6.0.0 Class summary

That's our second class on Python! You've made it through and we've learned about a lot of data structures:

  1. Lists
  2. Dictionaries
  3. Tuples
  4. Arrays
  5. Series and DataFrames

6.0.1 A cheatsheet for data structures

Structure Package Description Initializer(s) Indexed? Mutable? Nestable?
List Python core A 1D container of elements that can be made of anything list() Yes Yes Yes
[value1, value2, ...]
Tuple Python core A 1D container of elements that can be made of anything tuple() Yes No Yes
(value1, value2, ...)
Dictionary Python core A container of key:value pairs where keys must be immutable dict() No keys = No Yes
{key1:value, key2:value2, ...} values = Yes
Array Numpy A multi-dimensional container of a single immutable data type np.array() Yes Yes NA
Series Pandas A 1D container of a single immutable data type pd.series() Yes size = No No
values = Yes
DataFrame Pandas A 2D container of Series objects (columns) pd.DataFrame(dict) Yes num rows = No* Yes**
num cols = Yes
values = Yes

6.1.0 Post-lecture assessment (12% of final grade)

Soon after the end of each lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete chapters 2 (Python Lists, 1300 possible points) and 4 (NumPy, 1400 possible points) from the Introduction to Python course. This is a pass-fail assignment, and in order to pass you need to achieve a least 2025 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.

In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course. It should look something like this:

DataCamp.example.png

Submit the file(s) for the homework to the assignment section of Quercus. This allows us to keep track of your progress while also producing a standardized way for you to check on your assignment "grades" throughout the course.

You will have until 13:59 hours on Thursday, July 1st to submit your assignment. There is no lecture that week but assignments are still due.


6.2.0 Acknowledgements

Revision 1.0.0: materials prepared by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.1.0: edited and prepared for CSB1021H S LEC0140, 06-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


7.0.0 Appendix

7.1.0 What happens when we slice a list?

Here's a small example of list slicing and what we get back. Remember we talked about getting a view or copy back in section 5.5.4?

Let's recall that every object is assigned an integer ID in Python. You can find this with the id() function. When we assign a variable to an object, Python is providing a link or reference back to the ID of that object. Python essentially uses the IDs to assign memory space to an object where it's value can be stored. Python uses references to keep track of IDs so that when there are no more (0) references to an object ID, Python can reclaim the memory where its value is heald and its ID to use for something else in the program.

Whenever we slice through a list, it will return a copy of the references to those elements. Rather than copying each element to a new space in memory (and object ID), it uses the reference to find where those individual objects are in memory. This can bring up some rather tricky concepts when we have nested lists that are made using other list object references and what happens to them.

Depending on how you have sliced your original list, you may be getting back a direct reference (ID) to the original object, or references for the elements of the original list!

Let's take a look at what just happened so we can break down the code in a bit more detail.

7.1.1 Assigning and slicing our lists

  1. list_1 = ["genome", 20, 30.5, True]: we generate a list with 4 elements.

  2. list_2 = [list_1, "this", "that", "those"]: we generate a second list of 4 elements with the first element being a list itself

  3. list_3 = list_2[0:3]: make a list copying the first 3 references to list_2; this includes the reference to the list_1 object.

  4. list_4 = list_1[0:4]: make a list by copying references for the first 4 elements in list_1; we do not copy the reference to list_1 object itself.

  5. list_5 = list_2[0][0:4]: make a list by copying references for the first 4 elements in the first element of list_2; it should be very similar to list_4.

  6. list_6 = list_2[0]: copy the reference to the first element of list_2 which is the reference to the list_1 object.

  7. list_7 = list_2[:]: copy all of the references to the elements of list_2 which includes a reference to the list_1 object.

7.1.2 Make a change directly to list_1

  1. list_1[0] = "genomic": Now we've changed the first element to list_1. Any other objects directly referencing list_1 will propogate this change. Objects that merely reference the elements of list_1 will not propagate this change.

7.1.3 What is the resulting output?

Here's a summary of the list objects and their theoretical output

Object Direct reference to list_1? example object ID Contents at assignment Contents after changing list_1
list_1 Yes, of course! 1 ['genome', 20, 30.5, True] ['genomic', 20, 30.5, True]
list_2 Yes, at index = 0 2 [['genome', 20, 30.5, True], 'this', 'that', 'those'] [['genomic', 20, 30.5, True], 'this', 'that', 'those']
list_3 Yes, at index = 0 3 [['genome', 20, 30.5, True], 'this', 'that'] [['genomic', 20, 30.5, True], 'this', 'that']
list_4 No 4 ['genome', 20, 30.5, True] ['genome', 20, 30.5, True]
list_5 No 5 ['genome', 20, 30.5, True] ['genome', 20, 30.5, True]
list_6 Yes, a direct reference 1 ['genome', 20, 30.5, True] ['genomic', 20, 30.5, True]
list_7 Yes, at index = 0 6 [['genome', 20, 30.5, True], 'this', 'that', 'those'] [['genomic', 20, 30.5, True], 'this', 'that', 'those']

7.1.4 Know when you are viewing or copying when slicing

So we've taken a close look at lists and how they handle slicing. Remember to avoid getting a direct reference (view) for a simple list object when using the = assignment operator, you can use [:] of the .copy() method to copy all of the element references. When working with nested lists which contain references to other lists, however, beware of what you'll get.

From our above examples, what happens if we assigned list_6[0] = "GENOMIC". Can you guess?


7.2.0 What happens when we slice a NumPy array?

Now that we've covered lists and their potential complications, should we take a closer look at NumPy arrays? Similar in concept to lists, recall that these objects are implemented in a package independent of the Python kernel development.


7.2.1 Slicing an array with [ ] returns a reference to the original array

So, the way arrays handle slicing is different to how lists handle the same commands. From the simple example above, we see that the underlying behaviour is that when you slice an array with [] you are always returned a reference or view of that array. That being said, altering an array slice's reference can be quirky.

  1. array_4[0,0] = 50 is a direct slicing call with an assignment and so the change is propagated to array_1 and array_2
  1. array_3 = array_3 * 2 does not use any slicing notation. Instead the Python interpreter makes a new copy of array_3, completes the math on the array and then assigns this to the variable array_3. The reference that array_3 had to array_1 is released and replaced with one for this new object.

These are intentional behaviours of the array object. If a reference to an array is not what you want, then you can use the .copy() method to generate a shallow copy of the array. Any changes in the original will not be propagated to the copy and vice versa!

What do you think array_3[:,:] = array_3 * 2 would do?

The Center for the Analysis of Genome Evolution and Function (CAGEF)

The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.

From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.

For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.

CAGEF_new.png